-
Notifications
You must be signed in to change notification settings - Fork 322
Bulk http status & 429 retries #1868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@elasticmachine update branch |
|
My approach so far seems to work - Please take a look |
|
@elasticmachine update branch |
|
The failures are due to me changing the parameters for detailed_stats - https://github.com/elastic/rally-tracks/blob/master/elastic/shared/runners/bulk.py#L36C25-L36C39 I guess I can either thoughts? |
gbanasiak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting coupling problem with super().detailed_stats(...) used in elastic/logs. Tracks version control is only for ES, not for Rally, so changing custom detailed_stats() say like below will fail when calling parent's method in old Rally.
def detailed_stats(self, params, response, emit_lines_to_retry=False):
stats = super().detailed_stats(params, response, emit_lines_to_retry) <--- HERE
return {**stats, **params["param-source-stats"]}
It would be best if we minimized the surface between tracks code and rally code. Our docs only mention top-level runner call with es and params arguments. Due to this I would be happy to declare that what this particular runner is doing as not supported. I would even go as far as clarify this in documentation.
My vote would be to change elastic/logs custom runner to avoid calling detailed_stats() completely and backport this to all branches where modified Rally might potentially be used, so I'd say all 8.x up until now.
Something like this perhaps?
class RawBulkIndex(BulkIndex):
async def __call__(self, es, params):
meta_data = await super().__call__(es, params)
if params.get("detailed-results", False):
meta_data.update(params["param-source-stats"])
return meta_data
The alternative would be to keep detailed_results() as is, and determine documents to retry separately but that means iterating through a response twice which would impact processing time.
I'm curious about @fressi-elastic thoughts on that one as well.
I'm adding further comments below.
I think we also planned on adding exponential back-off. Is this still a plan, or not in this PR?
| break | ||
| self.logger.warning("Retrying %d documents that previously resulted in a 429.", len(lines_to_retry) / 2) | ||
| api_kwargs["body"] = lines_to_retry | ||
| bulk_size = len(lines_to_retry) / 2 # at this point the data always contains action metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at this point the data always contains action metadata
This is a slight off-topic:
I've spent some time digging once I saw this comment. I think bulk runner always receives action-and-metadata lines in the body param today (see here). If corpus does not include them they are generated in earlier processing stages. I don't quite understand this from the above code:
if with_action_metadata:
api_kwargs.pop("index", None)
# only half of the lines are documents
response = await es.bulk(params=bulk_params, **api_kwargs)
else:
response = await es.bulk(doc_type=params.get("type"), params=bulk_params, **api_kwargs)
The only half of the lines are documents comment suggests the else clause is different, but it isn't. There's nothing in bulk() method of ES client that would magically add action-and-metadata lines. Also doc_type is ignored I think, it's a leftover from old ES versions.
I think we could simplify / remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea - it seems at one point we calculated the number of documents here, but that was removed and the comment not removed. I'll remove the comment for now, I think i would need to test a little more whether we have code paths that use the with_action_metadata or not
| if detailed_results: | ||
| stats = { | ||
| "success-count": total_success, | ||
| "error-count": total_error, | ||
| "retry-count": retry_count, | ||
| "took": total_time, | ||
| "success": len(lines_to_retry) == 0, | ||
| "retried": retry_count > 0, | ||
| "bulk-request-size-bytes": sum_bulk_request_size_bytes, | ||
| "total-document-size-bytes": sum_total_document_size_bytes, | ||
| "ops": {}, # detailed per-op stats are not aggregated over retries | ||
| "shards_histogram": [], # detailed per-shard stats are not aggregated over retries | ||
| } | ||
| else: | ||
| stats = { | ||
| "success-count": total_success, | ||
| "error-count": total_error, | ||
| "retry-count": retry_count, | ||
| "took": total_time, | ||
| "success": len(lines_to_retry) == 0, | ||
| "retried": retry_count > 0, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exposes the details of stats at _call__() level which were previously hidden either in simple_stats() or detailed_stats(). Can we avoid this? We could have a method that iterates through response documents, and:
- calls another method (passed as an argument) that increases stats counters for each document,
- builds a retry list (optionally).
Future direction: